Banking Churn Prediction

The objective of this project is to develop a binary classification model that helps predict whether an European retail bank customer is going to churn or not. I conducted the project as part of a Kaggle competition.

Data Manipulation

Exploratory Data Analysis (EDA)

Churning Statistics

From this chart, one can see that 20% of the customers respresented by the dataframe have churned and 80% of the customers have not churned.

Distribution of Categorical Variables

The graph above shows us that among the churned customers those who are are geographycally located in Germay have yhe highest rate of churn with 40%, followed by France with 39.8% and Spain with 20.3%. For non chun customers France is leading with 52.8%, Spain with 25.9% and Germany with 21.3%.

The output above shows us that for the churn customers female have 55.9%, whereas male with 44.1%. For the case of non churn customers 57.3% are male and 42.7% are female.

The graph above shows that among the churn customers, the rate of those who use one product is very high with $69.2\%$, followed by those who use two products with $17.1\%$, three products with $10.8\%$, and four products with $2.95\%$. For non churn customers, customers with two products are $53.3\%$, one product are $46.2\%$, and three products are $0.58\%$.

The output above shows us that for the churn customers those who possess a card are 69.9%, whereas those don't possess are 30.1%. For the case of non churn customers 70.7% possess a card and $29.3\%$ don't possess a card.

The output above shows us that the among the churned customers those who are not active members have a high rate of churn with 63.9%, and active members with 36.1%. For non chun customers active members are leading with 55.5%, and non active members with 44.5%.

Distribution of Continuous Variables

The graph above shows us that the customers with age of 46 are the most churned.

Missing Values

Correlation

Data Preparation

Feature Engineering & Building Base Model

Feature Importance

Building & Training Base Model

Testing Models

ROC-AUC Performance Visualisations

Optimization

Cross-Validation

The dictionary shows how accuracy is changing throughout cross validation process for each model. The first number in lists contain mean (first number) and standard deviation (second number).

Hyperparameter Tuning

For hyperparameter tuning, we are choosting the top two performing models - AdaBoost & GradientBoosting

AdaBoost

The output above shows that the optimal value.

GradientBoosting

Train models with New Best Parameters

Feature engineering

Voting-based ensemble model

With Transformed Data

With Untransformed Data

With transformed data, our voting-based ensemble model performs slightly better.

Conclusion

In this project I build a model that predicts how likely a customer is going to churn. During exploratory data analysis I found out that the female customer are who are located in Germany and also customer who are using only one product are the most likely customers to churne. After building, training and evalutaing various models, I chose best performing models, GradientBoosting and AdaBoost, to furether improve prediction accuracy through hyperparameter tuning, cross validation and ensembling. Since the problem was a binary classification with a imbalance dataset, I chose to use 'roc auc score' metric to evaluate models' performance, which was 87%. Furthermore, my best model's accuracy was 87%. To further improve the model's performance, I should gather more data for the training set.